Add PTX vector memory intrinsics#4
Conversation
|
Awesome! I'll take a look asap. |
|
Is there anything I could help with? I mean, adding more tests, for example? |
- Add PTXMemory class (ILGPU.Algorithms.PTX) with ld.v2/v4.f32 and st.v2/v4.f32 intrinsics; Float2/Float4 structs - Add ArrayView LoadVectorized/StoreVectorized/CastAligned extension helpers - Revert CudaAccelerator.DefaultMaxRegistersPerThread default from 255 to 0 (restores occupancy on normal kernels) - Remap System.Numerics.BitOperations to hardware-backed IntrinsicMath methods (CLZ/PopC/CTZ) - Add CUDA-only unit tests for all new PTX vector memory variants - Bump ILGPU/ILGPU.Algorithms fork to 2.0.7; SpawnDev.ILGPU to 4.9.6-local.1 Addresses ilehtoranta Discussion #5 and PR #4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Merged in commit 2ec94d6, shipping in 4.9.6 (currently published to my local feed as Applied changes:
The Added CUDA-only unit tests covering all five variants (ld.v2.f32, ld.v4.f32, st.v2.f32 from struct, st.v2/v4.f32 from scalar args, and the Closing as manually applied. Thank you for the well-structured contribution — the PTX code generators followed the existing ILGPU pattern exactly, easy to drop in. |
|
Thanks! The AI made this easy =) |
Summary
Adds PTX-only vector memory intrinsics for explicit
f32vector load/store code generation.This introduces:
PTXMemory.LoadF32x2/StoreF32x2PTXMemory.LoadF32x4/StoreF32x4Float2andFloat4helper structsArrayViewconvenience helpersThe main use case is CUDA kernels that need predictable vector memory instructions instead of relying on backend inference from ordinary scalar or struct access patterns.
Details
The new PTX intrinsics generate explicit PTX vector memory operations:
ld.v2.f32st.v2.f32ld.v4.f32st.v4.f32For
f32x4, ptxas can lower these to 128-bit global memory instructions such asLD.E.128andST.E.128when alignment and addressing are suitable.This is useful for performance-sensitive kernels that operate on adjacent float values.